---
id: getting-started-mcp
title: MCP Evaluation Quickstart
sidebar_label: MCP
---

import { Timeline, TimelineItem } from "@site/src/components/Timeline";
import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
import VideoDisplayer from "@site/src/components/VideoDisplayer";

Learn to evaluate model-context-protocol (MCP) based applications using `deepeval`, for both single-turn and multi-turn use cases.

## Overview

MCP evaluation is different from other evaluations because you can choose to create single-turn test cases or multi-turn test cases based on your application design and architecture.

**In this 10 min quickstart, you'll learn how to:**

- Track your MCP interactions
- Create test cases for your application
- Evaluate your MCP based application using MCP metrics

## Prerequisites

- Install `deepeval`
- A Confident AI API key (recommended). Sign up for one [here](https://app.confident-ai.com)

:::info
Confident AI allows you to view and share your testing reports. Set your API key in the CLI:

```bash
CONFIDENT_API_KEY="confident_us..."
```

:::

## Understanding MCP Evals

**Model Context Protocol (MCP)** is an open-source framework developed by **Anthropic** to standardize how AI systems, particularly large language models (LLMs), interact with external tools and data sources.
The MCP architecture is composed of three main components:

- **Host** — The AI application that coordinates and manages one or more MCP clients
- **Client** — Maintains a one-to-one connection with a server and retrieves context from it for the host to use
- **Server** — Paired with a single client, providing the context the client passes to the host

![MCP Architecture Image](https://deepeval-docs.s3.amazonaws.com/mcp-architecture.png)

`deepeval` allows you to evaluate the MCP host on various criterion like its primitive usage, argument generation and task completion.

## Run Your First MCP Eval

In `deepeval` MCP evaluations can be done using either single-turn or multi-turn test cases. In code, you'll have to track all MCP interactions and finally create a test case after the execution of your application.

:::note

`deepeval` provides a wide selection of LLM models that you can easily choose from and run evaluations with.

<Tabs>

<TabItem value="openai" label="OpenAI">

```python
from deepeval.metrics import MCPUseMetric

task_completion_metric = MCPUseMetric(model="gpt-4.1")
```

</TabItem>

<TabItem value="anthropic" label="Anthropic">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import AnthropicModel

model = AnthropicModel("claude-3-7-sonnet-latest")
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

<TabItem value="gemini" label="Gemini">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import GeminiModel

model = GeminiModel("gemini-2.5-flash")
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

<TabItem value="azure-openai" label="Ollama">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import OllamaModel

model = OllamaModel("deepseek-r1")
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

<TabItem value="grok" label="Grok">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import GrokModel

model = GrokModel("grok-4-0709")
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

<TabItem value="azure" label="Azure OpenAI">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import AzureOpenAIModel

model = AzureOpenAIModel(
    model_name="gpt-4.1",
    deployment_name="Test Deployment",
    azure_openai_api_key="Your Azure OpenAI API Key",
    openai_api_version="2025-01-01-preview",
    azure_endpoint="https://example-resource.azure.openai.com/",
    temperature=0
)
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

<TabItem value="amazon-bedrock" label="Amazon Bedrock">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import AmazonBedrockModel

model = AmazonBedrockModel(
    model_id="anthropic.claude-3-opus-20240229-v1:0",
    temperature=0
)
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

<TabItem value="vertex-ai" label="Vertex AI">

```python
from deepeval.metrics import MCPUseMetric
from deepeval.models import GeminiModel

model = GeminiModel(
    model_name="gemini-1.5-pro",
    project="Your Project ID",
    location="us-central1",
    temperature=0
)
task_completion_metric = MCPUseMetric(model=model)
```

</TabItem>

</Tabs>
:::

<Timeline>
<TimelineItem title="Create an MCP server">

Connect your application to MCP servers and create the `MCPServer` object for all the MCP servers you're using.

```python title="main.py" showLineNumbers {5,19-23}
import mcp
from contextlib import AsyncExitStack
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
from deepeval.test_case import MCPServer

url = "https://example.com/mcp"

mcp_servers = []
tools_called = []

async def main():
    read, write, _  = await AsyncExitStack().enter_async_context(streamablehttp_client(url))
    session = await AsyncExitStack().enter_async_context(ClientSession(read, write))
    await session.initialize()

    tool_list = await session.list_tools()

    mcp_servers.append(MCPServer(
        name=url,
        transport="streamable-http",
        available_tools=tool_list.tools,
    ))
```

</TimelineItem>
<TimelineItem title="Track your MCP interactions">

In your MCP application's main file, you need to track all the MCP interactions during run time. This includes adding `tools_called`, `resources_called` and `prompts_called` whenever your host uses them.

![MCP Interaction tracking](https://deepeval-docs.s3.us-east-1.amazonaws.com/docs:evaluation-mcp-tools.png)

```python title="main.py" showLineNumbers {1,20-24}
from deepeval.test_case import MCPToolCall

available_tools = [
    {"name": tool.name, "description": tool.description, "input_schema": tool.inputSchema}
    for tool in tool_list
]

response = self.anthropic.messages.create(
    model="claude-3-5-sonnet-20241022",
    messages=messages,
    tools=available_tools,
)

for content in response.content:
    if content.type == "tool_use":
        tool_name = content.name
        tool_args = content.input
        result = await session.call_tool(tool_name, tool_args)

        tools_called.append(MCPToolCall(
            name=tool_name,
            args=tool_args,
            result=result
        ))
```

You can also track any [resources](https://www.deepeval.com/docs/evaluation-mcp#resources) or [prompts](https://www.deepeval.com/docs/evaluation-mcp#prompts) if you use them. You are now tracking all the MCP interactions during run time of your application.

</TimelineItem>
<TimelineItem title="Create a test case">

You can now create a test case for your MCP application using the above interactions.

```python
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
    input=query,
    actual_output=response,
    mcp_servers=mcp_servers,
    mcp_tools_called=tools_called,
)
```

The test cases must be created after the execution of your application. Click here to see a [full example on how to create single-turn test cases](https://github.com/confident-ai/deepeval/blob/main/examples/mcp_evaluation/mcp_eval_single_turn.py) for MCP evaluations.

:::tip
You can make your `main()` function return `mcp_servers`, `tools_called`, `resources_called` and `prompts_called`. This helps you import your MCP application anywhere and create test cases easily in different test files.
:::

</TimelineItem>
<TimelineItem title="Define metrics">

You can now use the [`MCPUseMetric`](/docs/metrics-mcp-use) to run evals on your single-turn your test case.

```python
from deepeval.metrics import MCPUseMetric

mcp_use_metric = MCPUseMetric()
```

</TimelineItem>
<TimelineItem title="Run an evaluation">

Run an evaluation on the test cases you previously created using the metrics defined above.

```python
from deepeval import evaluate

evaluate([test_case], [mcp_use_metric])
```

🎉🥳 **Congratulations!** You just ran your first single-turn MCP evaluation. Here's what happened:

- When you call `evaluate()`, `deepeval` runs all your `metrics` against all `test_cases`
- All `metrics` outputs a score between `0-1`, with a `threshold` defaulted to `0.5`
- The `MCPUseMetric` first evaluates your test case on its primitive usage to see how well your application has utilized the MCP capabilities given to it.
- It then evaluates the argument correctness to see if the inputs generated for your primitive usage were correct and accurate for the task.
- The `MCPUseMetric` then finally takes the minimum of the both scores to give a final score to your test case.

</TimelineItem>

<TimelineItem title="View on Confident AI (recommended)">

If you've set your `CONFIDENT_API_KEY`, test runs will appear automatically on [Confident AI](https://app.confident-ai.com), the DeepEval platform.

<VideoDisplayer
  src="https://deepeval-docs.s3.us-east-1.amazonaws.com/docs:getting-started-mcp-single-turn.mp4"
  confidentUrl="https://www.confident-ai.com/docs/llm-evaluation/dashboards/testing-reports"
  label="Evaluations Test Reports on Confident AI"
/>

:::tip
If you haven't logged in, you can still upload the test run to Confident AI from local cache:

```bash
deepeval view
```

:::

</TimelineItem>

</Timeline>

## Multi-Turn MCP Evals

For multi-turn MCP evals, you are required to add the `mcp_tools_called`, `mcp_resource_called` and `mcp_prompts_called` in the `Turn` object for each turn of the assistant. (if any)

<Timeline>
<TimelineItem title="Track your MCP interactions">

During the interactive session of your application, you need to track all the MCP interactions. This includes adding `tools_called`, `resources_called` and `prompts_called` whenever your host uses them.

![MCP Interaction tracking](https://deepeval-docs.s3.us-east-1.amazonaws.com/docs:evaluation-mcp-tools.png)

```python title="main.py" {7,13}
from deepeval.test_case import MCPToolCall, Turn

async def main():
    ...

    result = await session.call_tool(tool_name, tool_args)
    tool_called = MCPToolCall(name=tool_name, args=tool_args, result=result)

    turns.append(
        Turn(
            role="assistant",
            content=f"Tool call: {tool_name} with args {tool_args}",
            mcp_tools_called=[tool_called],
        )
    )
```

You can also track any [resources](https://www.deepeval.com/docs/evaluation-mcp#resources) or [prompts](https://www.deepeval.com/docs/evaluation-mcp#prompts) if you use them. You are now tracking all the MCP interactions during run time of your application.

</TimelineItem>
<TimelineItem title="Create a test case">

You can now create a test case for your MCP application using the above `turns` and `mcp_servers`.

```python
from deepeval.test_case import ConversationalTestCase

convo_test_case = ConversationalTestCase(
    turns=turns,
    mcp_servers=mcp_servers
)
```

The test cases must be created after the execution of the application. Click here to see a [full example on how to create multi-turn test cases](https://github.com/confident-ai/deepeval/blob/main/examples/mcp_evaluation/mcp_eval_multi_turn.py) for MCP evaluations.

:::tip
You can make your `main()` function return `turns` and `mcp_servers`. This helps you import your MCP application anywhere and create test cases easily in different test files.
:::

</TimelineItem>
<TimelineItem title="Define metrics">

You can now use the [MCP metrics](/docs/metrics-multi-turn-mcp-use) to run evals on your test cases. There's two metrics for multi-turn test cases that support MCP evals.

```python
from deepeval.metrics import MultiTurnMCPUseMetric, MCPTaskCompletionMetric

mcp_use_metric = MultiTurnMCPUseMetric()
mcp_task_completion = MCPTaskCompletionMetric()
```

</TimelineItem>
<TimelineItem title="Run an evaluation">

Run an evaluation on the test cases you previously created using the metrics defined above.

```python
from deepeval import evaluate

evaluate([convo_test_case], [mcp_use_metric, mcp_task_completion])
```

🎉🥳 **Congratulations!** You just ran your first multi-turn MCP evaluation. Here's what happened:

- When you call `evaluate()`, `deepeval` runs all your `metrics` against all `test_cases`
- All `metrics` outputs a score between `0-1`, with a `threshold` defaulted to `0.5`
- You used the `MultiTurnMCPUseMetric` and `MCPTaskCompletionMetric` for testing your MCP application
- The `MultiTurnMCPUseMetric` evaluates your application's capability on primitive usage and argument generation to get the final score.
- The `MCPTaskCompletionMetric` evaluates whether your application has satisfied the given task for all the interactions between user and assistant.

</TimelineItem>
<TimelineItem title="View on Confident AI (recommended)">

If you've set your `CONFIDENT_API_KEY`, test runs will appear automatically on [Confident AI](https://app.confident-ai.com), the DeepEval platform.

<VideoDisplayer
  src="https://deepeval-docs.s3.us-east-1.amazonaws.com/docs:getting-started-mcp-multi-turn.mp4"
  confidentUrl="https://www.confident-ai.com/docs/llm-evaluation/multi-turn/end-to-end"
  label="Multi-Turn End-to-End Evals"
/>

:::tip
If you haven't logged in, you can still upload the test run to Confident AI from local cache:

```bash
deepeval view
```

:::

</TimelineItem>
</Timeline>

## Next Steps

Now that you have run your first MCP eval, you should:

1. **Customize your metrics**: You can change the threshold of your metrics to be more strict to your use-case.
2. **Prepare a dataset**: If you don't have one, [generate one](/docs/synthesizer-introduction) as a starting point to store your inputs as goldens.
3. **Setup Tracing**: If you created your own custom MCP server, you can [setup tracing](https://documentation.confident-ai.com/docs/llm-tracing/tracing-features/span-types) on your tool definitons.

<VideoDisplayer
  src="https://confident-docs.s3.us-east-1.amazonaws.com/llm-tracing:spans.mp4"
  confidentUrl="/docs/llm-tracing/introduction"
  label="Span-Level Evals in Production"
/>

You can [learn more about MCP here](/docs/evaluation-mcp).
